feat(adf): subscription-based provider routing with persona and skill chains#705
feat(adf): subscription-based provider routing with persona and skill chains#705AlexMikhalev wants to merge 33 commits intomainfrom
Conversation
Security hardening for terraphim_rlm crate: 1. Created validation.rs module with: - validate_snapshot_name(): Prevents path traversal attacks - validate_code_input(): Enforces MAX_CODE_SIZE (1MB) limit - validate_session_id(): Validates UUID format - validate_recursion_depth(): Prevents stack overflow - Security constants: MAX_CODE_SIZE, MAX_INPUT_SIZE, MAX_RECURSION_DEPTH 2. Fixed race condition in firecracker.rs: - Changed snapshot counter from read-then-write to atomic write lock - Added validate_snapshot_name() call before snapshot creation - Prevents TOCTOU vulnerability where concurrent snapshots could exceed limit 3. Enhanced mcp_tools.rs: - Added MAX_CODE_SIZE validation for rlm_code tool - Added MAX_CODE_SIZE validation for rlm_bash tool - Returns proper MCP error format for validation failures Refs #426
Refs: PR #426 Features: - fcctl-core to terraphim_firecracker adapter - Sub-500ms VM allocation (267ms measured) - ULID-based VM ID enforcement - Full trait implementation with error preservation - 119 tests passing Validation: All acceptance criteria met
…erTier enum Add provider, fallback_provider, fallback_model, and provider_tier fields to AgentDefinition for subscription-based model routing (ADR-002, ADR-003). Add ProviderTier enum (Quick/Deep/Implementation/Oracle) with per-tier timeout values. Add opencode CLI support in spawner arg inference. All new fields are Optional with serde(default) for backward compatibility. Fixes #28 Refs #29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add provider allowlist and banned_providers fields to OrchestratorConfig. Default banned list includes "opencode" (Zen pay-per-use proxy, ADR-002). Add validate_provider() method to spawner AgentConfig that rejects any model string starting with a banned prefix, while correctly allowing opencode-go/ (subscription) and kimi-for-coding/ (subscription). Fixes #31 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, meta_cortex_connections, and skill_chain fields to AgentDefinition for the four-layer identity stack (Persona/Role/SFIA/Skills). All fields Optional/default for backward compatibility. See ADR-004. Fixes #32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SpawnRequest struct and spawn_with_fallback() method that: - Validates banned providers before spawn (ADR-002) - Uses ProviderTier timeout values (Quick=30s, Deep=60s, Impl=120s, Oracle=300s) - Retries with fallback provider/model on primary failure - Integrates per-provider circuit breakers (3 failures = open 5 min) - Returns SpawnerError on both primary and fallback failure SpawnRequest avoids circular dependency with terraphim_orchestrator by mirroring needed AgentDefinition fields in the spawner crate. Fixes #30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add persona_name, persona_symbol, persona_vibe, and meta_cortex_connections to SpawnRequest. The build_persona_prefix() function generates a markdown identity block that is prepended to the agent task prompt when persona fields are configured. Agents without persona config are unaffected. Fixes #33 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillChainRegistry with 31 terraphim-skills and 16 zestic-skills. Provides validate_chain() to verify skill chains and validate_skill_chains() on OrchestratorConfig to validate all agents. Backward compatible -- agents with empty skill_chain pass validation. Fixes #34 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add OpenCodeEvent struct with parsing for opencode run --format json output events (step_start, text, tool_use, step_finish, result). Includes text_content(), total_tokens(), parse_line(), and parse_lines() helper methods for structured output consumption. Fixes #37 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add SkillResolver in skill_resolver.rs that maps skill chain names to actual skill file paths from the terraphim-skills repository. Features: - SkillResolver with registry of terraphim skills (security-audit, code-review, session-search, local-knowledge, git-safety-guard, devops, disciplined-research, architecture, disciplined-design, requirements-traceability, testing, acceptance-testing, documentation, md-book, implementation, rust-development, visual-testing, quality-gate) - resolve_skill_chain() method that takes Vec<skill_name> and returns resolved skill metadata (name, description, applicable_to, paths, source) - Validation of skill chains with proper error handling - Comprehensive test coverage including valid resolution, missing skill errors, and empty chain handling - Exported from crate as SkillResolver, SkillSource, ResolvedSkill, SkillResolutionError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h Refs #36 Extend SkillResolver to support zestic-engineering-skills from zestic-ai/6d-prompts repository alongside existing terraphim-skills. Features: - Add SkillSource enum (Terraphim, Zestic) to distinguish skill origins - Initialize zestic skills registry with 12 skills: quality-oversight, responsible-ai, insight-synthesis, perspective-investigation, product-vision, wardley-mapping, business-scenario-design, rust-mastery, cross-platform, frontend, via-negativa-analysis, strategy-execution - Each resolved skill includes its source metadata - SkillChainRegistry validation accepts skills from both sources - Comprehensive tests for mixed skill chains (terraphim + zestic together) Tests added: - test_resolver_has_zestic_skills - test_resolve_zestic_skill - test_resolve_mixed_skill_chain - test_validate_mixed_skill_chain_valid - test_validate_only_zestic_skills - test_mixed_chain_with_invalid_skills - test_zestic_skill_source_in_resolved - test_all_skill_names_includes_zestic - test_zestic_skill_structure - test_resolver_custom_paths_for_zestic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…routing Refs #39 Replace all legacy codex CLI references with opencode + subscription providers. Add persona, skill_chain, provider/model/fallback fields to all agents. New agents: compliance-watchdog, drift-detector, spec-validator, test-guardian, documentation-generator, implementation-swarm, compound-review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SSH non-interactive shells (systemd) do not include ~/.bun/bin or ~/.local/bin in PATH. Use full paths for opencode and claude binaries. Also fix TOML escape in --since quote. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Creates crates/terraphim_judge_evaluator with: - SimpleAgent struct wrapping terraphim_router for KG lookups - KgMatch struct for representing term matches - lookup_terms() method using Aho-Corasick automata - enrich_prompt() for appending KG context to judge prompts - Comprehensive tests for all functionality Also adds automation/judge/model-mapping.json configuration file.
Implements JudgeModelRouter for tier-based LLM model selection: - TierConfig struct for provider+model pairs - ModelMappingConfig for complete configuration - JudgeModelRouter with methods: * from_config(path) - load from JSON file * resolve_tier(tier) - get provider/model for a tier * resolve_profile(profile) - get tier sequence for a profile - Supports tiers: quick, deep, tiebreaker, oracle - Supports profiles: default, thorough, critical, exhaustive - Comprehensive tests for all functionality - Updates automation/judge/model-mapping.json with profile definitions
Implements JudgeAgent as a supervised agent combining: - SimpleAgent for Knowledge Graph context enrichment - JudgeModelRouter for tier-based model selection - SupervisedAgent trait from terraphim_agent_supervisor JudgeVerdict struct with: - verdict: final evaluation result (PASS/FAIL/NEEDS_REVIEW) - scores: BTreeMap of detailed category scores - judge_tier: which tier produced the verdict - judge_cli: command used for evaluation - latency_ms: evaluation duration Evaluation pipeline: 1. Load file content 2. Enrich with KG context (optional) 3. Select model tier based on profile 4. Parse verdict from CLI output (JSON or text format) Includes comprehensive tests for: - Verdict parsing and manipulation - Supervised agent lifecycle - Profile-based tier selection - System message handling
- Add stagger_delay_ms config field (default: 5000ms) - Insert stagger delay between Safety agent spawns in run() - Add random jitter (0 to stagger_delay_ms) for Core agent cron spawns - Add tests for stagger delay configuration Refs #16
- Add ReviewRequest struct with from_agent, to_agent, artifact_path, review_type - Add ReviewPair config for defining (producer, reviewer) pairs - Add review_queue: Vec<ReviewRequest> to orchestrator state - Add submit_review_request(), review_queue(), process_review_queue() methods - Add check_review_trigger() to automatically queue reviews on agent completion - Add tests for review queue operations and config loading Refs #17
- Add drift_detection module with DriftDetector and DriftReport - Load strategic goals from plans/ directory markdown files - Check drift every N ticks (configurable via drift_detection.check_interval_ticks) - Calculate drift score by comparing agent outputs against strategic goals - DriftReport includes agent, drift_score, and explanation - Log warnings when drift_score exceeds threshold (default: 0.6) - Add DriftDetectionConfig to OrchestratorConfig - Add tests for drift detection functionality Refs #18
- Add session_rotation module with SessionRotationManager and AgentSession - Add SessionRotationConfig with max_sessions_before_rotation (default: 10) - Track completions_since_rotation and completed_sessions per agent - on_agent_completion() records completion and triggers rotation at threshold - Rotation creates new session ID and clears accumulated context - Add comprehensive tests for session rotation Refs #19
- Add convergence_detector module with ConvergenceDetector and ConvergenceSignal - Add ConvergenceConfig with threshold (default: 0.95) and consecutive_threshold (default: 3) - Calculate output similarity using Jaccard index on word sets - Detect convergence after N consecutive similar outputs - ConvergenceSignal includes agent, similarity, and consecutive_count - Reset on divergence to handle changing outputs - Add comprehensive tests for convergence detection Refs #20
…iven mode Issue #8: Extend orchestrator config for issue-driven mode - Add WorkflowConfig with mode (time_only/issue_only/dual), poll_interval_secs, max_concurrent_tasks - Add TrackerConfig with tracker_type (gitea/linear), url, token_env_var, owner, repo - Add ConcurrencyConfig with max_parallel_agents, queue_depth, starvation_timeout_secs - All optional in orchestrator.toml with backward compatible defaults (time_only mode is default) - Tests: parse config with and without workflow section, defaults applied Issue #9: Unified dispatcher - Create crates/terraphim_orchestrator/src/dispatcher.rs - DispatchTask enum: TimeTask(agent_name, schedule), IssueTask(agent_name, issue_id, priority) - DispatchQueue: priority queue backed by BinaryHeap with round-robin fairness - ConcurrencyController: semaphore-based with starvation timeout - Methods: submit(task), next() -> Option<DispatchTask>, active_count(), is_full() - Fairness: alternates between time and issue tasks at equal priority - Tests: submit/dequeue, priority ordering, concurrency limits, fairness Issue #10: Issue mode controller - Create crates/terraphim_orchestrator/src/issue_mode.rs - IssueMode struct using terraphim_tracker::GiteaTracker to poll for issues - Poll loop every poll_interval_secs fetching ready issues via PageRank sorting - Filter out blocked issues and already-running tasks - Map issues to agents based on labels ([ADF] -> implementation-swarm) or title patterns - Submit IssueTask to DispatchQueue - Tests: poll cycle, issue-to-agent mapping, priority calculation, blocked issue filtering Issue #11: Time mode refactor - Refactor crates/terraphim_orchestrator/src/scheduler.rs - Extract TimeMode struct wrapping existing cron scheduler - TimeMode submits TimeTask to DispatchQueue instead of spawning directly (when configured) - Maintain backward compatibility: legacy mode spawns directly if no WorkflowConfig - Tests: TimeMode submits to queue, legacy mode still works Refs #8 #9 #10 #11
- Add ModeCoordinator that manages TimeMode and IssueMode - Implement unified shutdown: signal modes, drain queue, wait for active tasks - Extend spawner integration to dispatch tasks from queue to agents - Add stall detection: log warning when queue exceeds threshold - Add comprehensive tests for dual mode, shutdown coordination, and stall detection Refs #12
Create comprehensive E2E test suite in tests/e2e_tests.rs: - test_dual_mode_operation: verify both time and issue tasks processed - test_time_mode_only: legacy config compatibility - test_issue_mode_only: issue-only config verification - test_fairness_under_load: no starvation between task types - test_graceful_shutdown: clean termination with queue draining - test_stall_detection: warning logged when queue exceeds threshold - Additional tests for concurrency limits, prioritization, and backward compatibility Uses mock tracker and avoids real API calls. Refs #13
- Create src/compat.rs: Symphony compatibility layer with type aliases, adapters - Add migration helpers for enabling dual mode from legacy configs - Create MIGRATION.md with comprehensive migration guide - Update CLAUDE.md with dual mode architecture description - Include backward compatibility notes and configuration examples - Export compat module from lib.rs Refs #14
- Run cargo test --workspace (orchestrator tests pass) - Fix all warnings in orchestrator crate - Create CHANGELOG.md entry for v1.9.0 release - Verify backward compatibility with test_backward_compatibility - Build release binary: cargo build --release -p terraphim_orchestrator - All 142 tests pass in orchestrator crate Release includes: - Dual mode orchestrator with TimeMode and IssueMode - Unified dispatch queue with fairness - Stall detection and graceful shutdown - Symphony compatibility layer - Comprehensive E2E test suite - Migration documentation Refs #15
- Fix needless borrows for generic args in workspace crate - Apply consistent code formatting across modified files
AlexMikhalev
left a comment
There was a problem hiding this comment.
PR #705 Review: Subscription-Based Provider Routing
Reviewer: Automated code review
Verdict: Changes requested -- several issues need addressing before merge
File Hygiene (HIGH)
117 of 205 changed files are NOT relevant to this PR:
- 101 files:
terraphim_server/dist/assets/bulmaswatch/-- CSS theme source files that should not be in version control (or at minimum should not be in this feature PR) - 5 files:
terraphim_server/dist/assets/*.js-- compiled frontend bundles - 8 files:
reports/upstream-sync-*.md,reports/coordination-*.md,reports/security-*.md-- ephemeral agent-generated reports - 3 files: Root-level
upstream-sync-*.md,coordination-*.md-- duplicates of the above - 1 file:
.deployment-marker-- deployment state file - 1 file:
~/.config/terraphim/sessions/telegram_287867183.jsonl-- personal session data leaked into PR
Action: Remove all unrelated files from this PR. The dist/ directory should be in .gitignore. The reports belong in a separate artifact store or CI output.
Architecture (MEDIUM)
Good
- Clean three-crate split:
terraphim_spawner(process management),terraphim_orchestrator(scheduling),terraphim_tracker(issue tracking) - Circuit breaker design (Closed -> Open after 3 failures -> Half-Open after 5min) is sound
- Four-tier provider routing (Quick/Deep/Implementation/Oracle) with fallback chains is well-structured
- ADRs are clear and well-motivated
Issues
-
Duplicated
ProviderTierenum acrossterraphim_spawnerandterraphim_orchestrator. Extract toterraphim_typesto prevent drift. -
SkillResolverhardcodes all 30 skills (18 terraphim + 12 zestic) as string literals in Rust code. Adding a skill requires recompile. The filesystem paths are configured but not used for discovery. -
Circuit breaker state is in-memory only -- resets on process restart. Acceptable for now but document the trade-off.
Security (CRITICAL + MEDIUM)
CRITICAL: PermittedProviderFilter uses substring matching
The filter uses terraphim_automata::find_matches() which performs Aho-Corasick substring matching. A provider string like "not-opencode-go" would match "opencode-go" in the thesaurus and pass validation. A malicious provider name containing a permitted substring would bypass the guard.
Fix: Switch to exact match via HashSet::contains on provider names, or validate that the match spans the entire input string. Automata-based substring matching is the wrong tool for an allowlist check.
MEDIUM: Path traversal in sfia_metaprompt
At lib.rs:744, sfia_metaprompt is joined to working_dir and read from disk without sanitization:
let full_path = working_dir.join(metaprompt_path);
std::fs::read_to_string(&full_path)If config ever comes from an untrusted source, ../../etc/passwd would read arbitrary files. Add path canonicalization and verify the resolved path stays within working_dir.
Code Quality (HIGH + MEDIUM)
HIGH: Mutex::unwrap() in production code (health.rs)
Lines 335, 341, 348, 354, 372 all use .lock().unwrap(). A poisoned mutex will panic the health check loop. Use .lock().unwrap_or_else(|e| e.into_inner()) or switch to parking_lot::Mutex (no poisoning).
MEDIUM: Dead mpsc channel in OutputCapture
At output.rs:47:
let (event_sender, _event_receiver) = mpsc::channel(100);The receiver is immediately dropped. All subsequent event_sender.send() calls silently fail. Either remove the channel entirely (broadcast is the real delivery) or wire up the receiver.
MEDIUM: Thesaurus cloned on every is_permitted() call
At lib.rs:136, thesaurus.clone() is called for every spawn request validation. If this PR switches to HashSet (per the security fix), this becomes moot.
Testing (MEDIUM)
Good
- 92 unit + 11 integration tests is solid coverage
- Tests for persona injection (full/partial/none), skill chains, NDJSON parsing, subscription guard
- Using
echo/catas mock agent binaries is a pragmatic approach
Missing
-
No half-open/recovery test for
CircuitBreaker-- only Closed->Open is tested. The cooldown timer expiry and success-in-half-open paths are untested. -
claude_session_tests.rsre-implements types locally instead of importing from the crate. If production parsing diverges from these local structs, tests will pass while production breaks. -
Weak assertion in
test_parse_malformed_json_returns_none:
result.is_none() || input.is_empty()The || input.is_empty() clause makes the assertion trivially true for empty strings.
Config (MEDIUM)
orchestrator.toml has hardcoded absolute paths
Multiple references to /home/alex/terraphim-ai, /home/alex/.local/bin/claude, /home/alex/.bun/bin/opencode, /opt/ai-dark-factory/. These only work on one machine.
Fix: Use environment variable substitution ($HOME, $TERRAPHIM_DIR) or make paths relative to working_dir.
Dependencies (LOW)
- New deps (
nix,regex,chrono,uuid,cron) are all justified terraphim_automatais heavy for what is essentially a set membership check inPermittedProviderFilter-- if switching toHashSet, this dep can be removed from spawner
Summary of Required Actions
| Priority | Issue | Action |
|---|---|---|
| CRITICAL | Substring matching in provider filter | Switch to exact match |
| HIGH | 117 unrelated files in PR | Remove from branch, update .gitignore |
| HIGH | Mutex::unwrap() in health.rs |
Use poison-safe locking |
| MEDIUM | Dead mpsc channel in OutputCapture |
Remove or wire up |
| MEDIUM | Duplicated ProviderTier enum |
Extract to terraphim_types |
| MEDIUM | Path traversal in sfia_metaprompt |
Canonicalize and validate |
| MEDIUM | No circuit breaker half-open test | Add recovery test |
| MEDIUM | Hardcoded paths in orchestrator.toml | Use env vars or relative paths |
| LOW | Hardcoded skill registry | Load from filesystem |
Good foundational work on the provider routing and persona system. The architecture is sound and the ADRs are well-written. Addressing the security issue with substring matching and cleaning up the file hygiene would make this ready for merge.
Summary
opencode/Zen prefix at runtime (Fixes Fixed APIs for create article from server and search #31)Stats
ADRs
Test plan
cargo test -p terraphim_spawner-- 92 unit tests passcargo clippy -p terraphim_spawner-- clean